47 research outputs found
A framework for multidimensional indexes on distributed and highly-available data stores
No-relational databases are nowadays a common solution
when dealing with a huge data set and massive query workload.
These systems have been redesigned from scratch in order to achieve
scalability and availability at the cost of providing only a reduce set
of low-level functionality, thus forcing the client application to
implement complex logic. As a solution, our research group
developed Hecuba, a set of tools and interfaces, which aims to
facilitate developers with an efficient and painless interaction with
non-relational technologies.
This paper presents a part of Hecuba related to a particular missing
feature: multidimensional indexing. Our work focuses on the design
of architectures and the algorithms for providing multidimensional
indexing on a distributed database without compromising scalability
and availability
Mejora del rendimiento de las aplicaciones Java usando cooperación entre el sistema operativo y la máquina virtual de Java
El uso de los entornos virtualizados de ejecución se ha extendido a todos los ámbitos y, en particular, se está utilizando para el desarrollo y la ejecución de aplicaciones con un alto consumo de recursos. Por lo tanto, se hace necesario evaluar si estas plataformas ofrecen un rendimiento adecuado para este tipo de programas y si es posible aprovechar las características de estas plataformas para favorecer su ejecución.El objetivo principal de este trabajo ha sido ha sido demostrar que es posible explotar las características de los entornos virtualizados de ejecución para ofrecer a los programas una gestión de recursos que se adapte mejor a sus características.En este trabajo demostramos que el modelo de ejecución de este tipo de entornos, basado en la ejecución sobre máquinas virtuales, ofrece una nueva oportunidad para implementar una gestión específica de recursos, que permite mejorar el rendimiento de los programas sin renunciar a las numerosas ventajas de este tipo de plataformas como, por ejemplo, una portabilidad total del código de los programas.Para demostrar los beneficios de esta estrategia hemos seleccionado como caso de estudio la gestión del recurso memoria para los programas de cálculo científico en el entorno de ejecución de Java. Después de un análisis detallado de la influencia que tiene la gestión de memoria sobre este tipo de programas, hemos visto que añadir en el entorno de ejecución una política de prefetch de páginas que se adapte al comportamiento de los programas es una posible vía para mejorar su rendimiento.Por este motivo, hemos analizado detalladamente los requerimientos que debe cumplir esta política y cómo repartir las tareas entre los diferentes componentes del entorno de ejecución de Java para cumplir estos requerimientos.Como consecuencia, hemos diseñado una política de prefetch basada en la cooperación entre la máquina virtual y el sistema operativo. En nuestra propuesta, por un lado, las decisiones de prefetch se llevan a cabo utilizando todo el conocimiento que la máquina virtual tiene sobre el comportamiento dinámico de los programas y el conocimiento que el sistema operativo tiene sobre las condiciones de ejecución. Por otro lado, el encargado de llevar a cabo las decisiones de gestión es el sistema operativo, lo que garantiza la fiabilidad de la máquina.Además, esta estrategia es totalmente transparente al programador y al usuario, respetando el paradigma de portabilidad de los entornos de ejecución virtualizados.Hemos implementado y evaluado esta estrategia para demostrar los beneficios que ofrece al tipo de programas seleccionado y, aunque estos beneficios dependen de las características del programa, la mejora del rendimiento ha alcanzado hasta un 40% si se compara con el rendimiento obtenido sobre el entorno original de ejecución.Postprint (published version
Hecuba: NoSql made easy
Non-relational databases are nowadays a common
solution when dealing with huge data set and massive query work
load. These systems have been redesigned from scratch in order to
achieve scalability and availability at the cost of providing only a
reduce set of low-level functionality, thus forcing the client
application to take care of complex logics. As a solution, our
research group developed Hecuba, a set of tools and interfaces,
which aims to facilitate programmers with an efficient and easy
interaction with non-relational technologies
Introducing polyglot-based data-flow awareness to time-series data stores
The rising interest in extracting value from data has led to a broad proliferation of monitoring infrastructures, most notably composed by sensors, intended to collect this new oil. Thus, gathering data has become fundamental for a great number of applications, such as predictive maintenance techniques or anomaly detection algorithms. However, before data can be refined into insights and knowledge, it has to be efficiently stored and prepared for its later retrieval. As a consequence of this sensor and IoT boom, Time-Series databases (TSDB), designed to manage sensor data, became the fastest-growing database category since 2019. Here we propose a holistic approach intended to improve TSDB’s performance and efficiency. More precisely, we introduce and evaluate a novel polyglot-based approximation, aimed to tailor the data store, not only to time-series data –as it is done conventionally– but also to the data flow itself: From its ingestion, until its retrieval. In order to evaluate the approach, we materialize it in an alternative implementation of NagareDB, a resource-efficient time-series database, based on MongoDB, in turn, the most popular NoSQL storage solution. After implementing our approach into the database, we observe a global speed up, solving queries up to 12 times faster than MongoDB’s recently launched Time-series capability, as well as generally outperforming InfluxDB, the most popular time-series database. Our polyglot-based data-flow aware solution can ingest data more than two times faster than MongoDB, InfluxDB, and NagareDB’s original implementation, while using the same disk space as InfluxDB, and half of the requested by MongoDB.This research was partly supported by the Spanish Ministry of Science and Innovation (contract PID2019-107255GB) and by the Generalitat de Catalunya (contract 2017-SGR-1414).Peer ReviewedPostprint (published version
A holistic scalability strategy for time series databases following cascading polyglot persistence
Time series databases aim to handle big amounts of data in a fast way, both when introducing new data to the system, and when retrieving it later on. However, depending on the scenario in which these databases participate, reducing the number of requested resources becomes a further requirement. Following this goal, NagareDB and its Cascading Polyglot Persistence approach were born. They were not just intended to provide a fast time series solution, but also to find a great cost-efficiency balance. However, although they provided outstanding results, they lacked a natural way of scaling out in a cluster fashion. Consequently, monolithic approaches could extract the maximum value from the solution but distributed ones had to rely on general scalability approaches. In this research, we proposed a holistic approach specially tailored for databases following Cascading Polyglot Persistence to further maximize its inherent resource-saving goals. The proposed approach reduced the cluster size by 33%, in a setup with just three ingestion nodes and up to 50% in a setup with 10 ingestion nodes. Moreover, the evaluation shows that our scaling method is able to provide efficient cluster growth, offering scalability speedups greater than 85% in comparison to a theoretically 100% perfect scaling, while also ensuring data safety via data replication.This research was partly supported by the Grant Agreement No. 857191, by the Spanish Ministry of Science and Innovation (contract PID2019-107255GB) and by the Generalitat de Catalunya (contract 2017-SGR-1414).Peer ReviewedPostprint (published version
A compromise archive platform for monitoring infrastructures
The great advancement in the technological field has led to
an explosion in the amount of generated data. Many different
sectors have understood the opportunity that acquiring, storing,
and analyzing further information means, which has led to a
broad proliferation of measurement devices. Those sensors’
typical job is to monitor the state of the enterprise ecosystem,
which can range from a traditional factory, to a commercial
mall, or even to the largest experiment on Earth[1].
Big enterprises (BEs) are building their own big data
architectures, usually made out of a combination of several
state-of-the-art technologies. Finding new interesting data to
measure, store and analyze, has become a daily process in the
industrial field.
However, small and medium-sized enterprises (SMEs) usually
lack the resources needed to build those data handling
architectures, not just in terms of hardware resources, but also
in terms of contracting personnel who can master all those
rapidly evolving technologies.
Our research tries to adapt two world-wide-used technologies
into a single but elastic and moldable one, by tuning
them, to offer an alternative and efficient solution for this very
specific, but common, scenario
Evaluating the benefits of key-value databases for scientific applications
The convergence of Big Data applications with High-Performance Computing requires new methodologies to store, manage and process large amounts of information. Traditional storage solutions are unable to scale and that results in complex coding strategies. For example, the brain atlas of the Human Brain Project has the challenge to process large amounts of high-resolution brain images. Given the computing needs, we study the effects of replacing a traditional storage system with a distributed Key-Value database on a cell segmentation application. The original code uses HDF5 files on GPFS through an intricate interface, imposing synchronizations. On the other hand, by using Apache Cassandra or ScyllaDB through Hecuba, the application code is greatly simplified. Thanks to the Key-Value data model, the number of synchronizations is reduced and the time dedicated to I/O scales when increasing the number of nodes.This project/research has received funding from the European Unions Horizon
2020 Framework Programme for Research and Innovation under the Speci c
Grant Agreement No. 720270 (Human Brain Project SGA1) and the Speci c
Grant Agreement No. 785907 (Human Brain Project SGA2). This work has also
been supported by the Spanish Government (SEV2015-0493), by the Spanish
Ministry of Science and Innovation (contract TIN2015-65316-P), and by Generalitat
de Catalunya (contract 2017-SGR-1414).Postprint (author's final draft
Aeneas: A tool to enable applications to effectively use non-relational databases
Non-relational databases arise as a solution to solve the scalability problems of relational databases when dealing with big data applications. However, they are highly configurable prone to user decisions that can heavily affect their performance. In order to maximize the performance, different data models and queries should be analyzed to choose the best fit. This may involve a wide range of tests and may result in productivity issues. We present Aeneas, a tool to support the design of data management code for applications using non-relational databases. Aeneas provides an easy and fast methodology to support the decision about how to organize and retrieve data in order to improve the performance.Peer ReviewedPostprint (author’s final draft
Exploiting key-value data stores scalability for HPC
BigData revolutionised the IT industry. It first interested the OLTP systems. Distributed Hash Tables replaced Traditional SQL databases as they guaranteed low response time on simple read/write requests. The second wave recast the data warehousing: map-reduce systems spread as they proved to scale linearly long-running computational workloads on commodity servers. The focus now is on real-time analytics. Being able to analyse massive quantities of data in a short time enables multiple HPC applications and interactive analysis and visualization. In this paper, we study the performance of a system that employs the DHT architecture to achieve fast in local analysis on indexed data. We observed that the number of keys, nodes, and the hardware characteristics strongly influence the actual scalability of the system. Therefore, we developed a mathematical model that allows finding the right system configuration to meet desired performance for each kind of query type. We also show how our model can be used to find the right architecture for each distributed application.This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant
agreement No 720270 (HBP SGA1). It is also partially supported by grant SEV-2011-00067 of the Severo Ochoa Program awarded by the Spanish Government, the TIN2015-65316-P project, with funding from the Spanish Ministry of Economy and Competitivity, the European Union FEDER funds, and the SGR 2014-SGR-1051.Peer ReviewedPostprint (author's final draft
Automatic query driven data modelling in Cassandra
Non-relational databases have recently been the preferred
choice when it comes to dealing with Big Data challenges, but their
performance is very sensitive to the chosen data organisations. We
have seen differences of over 70 times in response time for the same
query on different models. This brings users the need to be fully
conscious of the queries they intend to serve in order to design their
data model. The common practice then, is to replicate data into
different models designed to fit different query requirements. In this
scenario, the user is in charge of the code implementation required to
keep consistency between the different data replicas. Manually
replicating data in such high layers of the database results in a lot of
squandered storage due to the underlying system replication
mechanisms that are formerly designed for availability and reliability
ends. We propose and design a mechanism and a prototype to
provide users with transparent management, where queries are
matched with a well-performing model option. Additionally, we
propose to do so by transforming the replication mechanism into a
heterogeneous replication one, in order to avoid squandering disk
space while keeping the availability and reliability features. The
result is a system where, regardless of the query or model the user
specifies, response time will always be that of an affine query